(12) United States Patent 

Kothuri et al. 



(54) HEIRARCHICAL INDEXING OF MULTI- 
ATTRIBUTE DATA BY SORTING, DIVIDING 
AND STORING SUBSETS 

(75) Inventors: Ravi Kothuri, Nashua, NH (US); Siva 
Ravada, Tewksbury, MA (US); Jayant 
Sharma; Jayanta Banerjee, both of 
Nashua, NH (US) 

(73) Assignee: Oracle Corporation, Redwood Shores, 
CA (US) 

( * ) Notice: Subject to any disclaimer, the term of this 
patent is extended or adjusted under 35 
U.S.C. 154(b) by 0 days. 

(21) Appl. No.: 09/322,901 

(22) Filed: May 29, 1999 

(51) Int. CI. 7 G06F 17/30 

(52) VS. CI 707/100; 707/2; 707/4; 

707/7; 707/102 

(58) Field of Search 707/1-8, 10, 100-104.1, 

707/200-205; 709/201, 204, 205, 222-223 

(56) References Cited 

U.S. PATENT DOCUMENTS 



5,404,512 A * 4/1995 Powers et al 707/3 

5,647,058 A * 7/1997 Agrawal et al 707/1 

5,701,467 A * 12/1997 Freeston 707/100 

5,706,503 A * 1/1998 Poppen et al 707/100 

5,752,243 A * 5/1998 Reiter et al 707/3 

5,781,906 A ♦ 7/1998 Aggarwal et al 707/102 

5,905,985 A * 5/1999 Malloy et al 707/100 

5,943,668 A * 8/1999 Malloy et al 707/3 

5,943,677 A * 8/1999 Hicks 707/205 

5,974,407 A * 10/1999 Sacks 707/2 

5,978,796 A * 11/1999 Malloy et al 707/3 

6,003,036 A * 12/1999 Martin 707/102 

6,122,628 A * 9/2000 Castelli et al 707/5 



III Ellllll II II J 111^ lllll Hill II II Hill II y HIS llilll IIB I II llll 

US006381605B1 

(10) Patent No.: US 6,381,605 Bl 
(45) Date of Patent: Apr. 30, 2002 



6,182,060 Bl * 1/2001 Hedgcock et al 707/1 

6,223,182 Bl * 4/2001 Agaiwal et al 707/102 

FOREIGN PATENT DOCUMENTS 

EP 0887758 A2 * 12/1998 17/60 

WO WO-01/04795 Al * 1/2001 17/30 

OTHER PUBLICATIONS 

Jurgens, M et al., The R/sub a/*-tree:an improved R*-tree 
with materialized data for supporting range queries on 
OLAP data, database and expert systems applications 1998 
proceedings, ninth international workshop Aug. 26-28, 
1998, pp. 186-191 * 

(List continued on next page.) 

Primary Examiner — -John Breene 

Assistant Examiner — Srirama Channavajjala 

(74) Attorney, Agent, or Firm — Park, Vaughan & Fleming 

LLP 

(57) ABSTRACT 

A system and method for indexing and storing multi- 
dimensional or multi-attribute data. Data items are recur- 
sively sorted in a selected dimension (e.g., the dimension 
having the greatest variance) and divided until each subdi- 
vision fits into a leaf node having a specified fanout. 
Intermediate nodes and a root node are constructed to 
complete the index. Each node of the index is stored in a 
database as a separate object or record and may include a 
node identifier of the unique, an identifier of a parent and/or 
a sibling node and an entry for each child of the node, which 
may be data items or other nodes. Each record entry for a 
child includes an associated bounding area encompassing 
descendant data items. Another database table or module 
may store information about the index, such as the dimen- 
sionality of the data, the index fanout and an identifier of a 
root of the index. 

27 Claims, 9 Drawing Sheets 
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HEIRARCHICAL INDEXING OF MULTI- 
ATTRIBUTE DATA BY SORTING, DIVIDING 
AND STORING SUBSETS 

Oracle, Oracle Server, and all Oracle Server-based trade- 
marks and logos are trademarks or registered trademarks of 
Oracle Corporation in the United States and other countries. 

BACKGROUND 

This invention relates to the field of database management 
systems. More particularly, a system and methods are pro- 
vided for indexing multi-dimensional data, storing such data 
in a relational database management system and efficiently 
retrieving the data upon demand. 

Various methods of managing collections of data (e.g., 
databases) have been developed since data was first stored in 
electronic form. From initial systems and applications that 
simply collected data in one or more flat database files to 
present sophisticated database management systems 
(DBMS), different solutions have been developed to meet 
different requirements. Early solutions may have had the 
advantage of simplicity but became obsolete for a variety of 
factors, such as the need to store large — even vast — 
quantities of data, a desire for more sophisticated search 
and/or retrieval techniques (e.g., based on relationships 
between data), the need to store different types of data (e.g., 
audio, visual), etc. 

A database may be considered distinct from a DBMS, in 
that a DBMS, as the name implies, includes utilities for 
accessing, updating and otherwise managing or operating a 
database. As the amount and complexity of data stored in 
databases has increased, DBMS design and development 
efforts have increasingly focused upon the ability to 
organize, store and access data quickly and efficiently. As a 
result, today's database management systems can be very 
effective in managing collections of linear information such 
as inventory, customer lists, etc. 

With such linear (or uni-dimensional) data — data that 
varies in value in a single dimerision— determining and 
maintaining relationships between data elements is rela- 
tively easy. For example, one value or data point (e.g., a 
price, a quantity) can easily be compared to another to 
determine which is "greater" or which ones match a par- 
ticular query. The ordinal nature of linear data therefore 
readily lends itself to basic indexing and subsequent storage, 
search, retrieval and other data management operations. Id 
particular, the appropriate point for insertion, retrieval or 
deletion of a data element in a database of linear data may 
be found with great facility by referring to a table or other 
data index. 

In short, today's database management systems have been 
designed to manage linear data very effectively. Present 
database management schemes are still poorly suited, 
however, for managing data that are multi-dimensional in 
nature. Geographic data, for example, may be meaningful 
only when expressed in at least two dimensions (e.g., 
latitude and longitude) and can thus be considered "inher- 
ently" multi-dimensional. Because such data can vary in 
value in more than one dimension, the relationship between 
selected geographic points is more complex and, unless a 
particular reference dimension or other criteria is chosen, 
one point cannot automatically be considered "greater" or 
"less" than another. The difficulty in expressing relations 
among sets of inherently multi-dimensional data makes 
indexing such data (e.g., for storage and retrieval) more 
complicated. 
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Closely related to inherently multi-dimensional data is 
multi-dimensional data that may also be termed "multi- 
attribute" in nature. Multi-attribute data may be defined as 
information that possesses multiple-defining characteristics 

5 that are not inherently related. For example, sales data may 
be considered multi-attribute in nature if characteristics such 
as time of sale, region of sale, salesperson, product sold, 
type/model of product, etc. are recorded for each sale. This 
data becomes multidimensional or multi-attribute in nature 

0 when queries are made or reports are desired that specify 
range predicates in two or more of the data attributes (e.g., 
the sales made by Salesperson A during the previous month, 
the best time of year for selling a particular product). 
Today's database management systems are, unfortunately, 

15 not designed to organize this data in a manner that enhances 
the ability to retrieve those data items satisfying a particular 
multi-dimensional query pattern 

Present techniques for dealing with (e.g., indexing, 
storing, retrieving) multi-dimensional data often involve 

2 q attempts to translate the data into a single dimension so that 
existing systems may be used. These techniques often fail to 
maintain important spatial relationships, however, thus 
adversely affecting the ability to respond rapidly to multi- 
dimensional queries. For example, linear quadtrees and 

25 Hilbert R-trees transform multi-dimensional data into a 
single dimension and then construct B -trees on the linear- 
ized data. Although these schemes may be adequate for 
two-dimensional data, linearizing data having three or more 
dimensions may result in an unacceptable loss of spatial 

30 relationship. 

Meanwhile, the number and types of applications that use 
multi-dimensional and multi-attribute data — such as geo- 
graphic information systems (GIS) and computer-aided 
design and manufacturing (CAD/CAM) systems — continue 

35 to grow. A GIS application may work with maps or charts in 
which spatial data are expressed in two or three dimensions 
(e.g., latitude, longitude and, possibly, altitude). Similarly, in 
CAD/CAM applications products such as printed circuit 
boards for computer systems may be designed using rect- 

40 angular areas or cubic volumes. A person using one of these 
applications may select an area of interest that contains or 
intersects one or more elements. The application must be 
able to accurately identify those elements and allow ready 
modification of the data. 

45 Multi-media applications, in which audio and visual ele- 
ments are combined in one database, are another area that 
can benefit from efficient storage of multi -dimensional data. 
For example, an element of a graphical image may require 
multi-dimensional data for accurate representation. In 

50 particular, graphic images may be described in terms of 
numerous spectral characteristics (all or some of which may 
be inherently inter-related). Thus, in a graphical element that 
embodies a combination of colors (e.g., some mixture of red, 
green and blue pixels on a computer display) the different 

55 colors of the element may be represented in different dimen- 
sions. Each dimension's value could, for example, represent 
the relative proportion of the corresponding color within the 
element or the number of pixels within the element that have 
that color. Accurate representation of this data would allow 

60 an application to easily identify elements that similar col- 
oring. 

Because effective methods of organizing multi- 
dimensional data in a DBMS have been generally unknown, 
applications that use such data have been unable to reap the 
65 advantages offered by today's database management 
systems, especially relational database management systems 
(RDBMS). Among those advantages are backup and recov- 
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ery utilities, robust security, and controls allowing concur- identifier of the child, which may be a data item (if the node 

rent data access. Developers of these applications have had is a leaf node) or another node. Each record also includes a 

to rely upon other methods of indexing and storing such bounding region or area that encompasses the data item (if 

data. the node is a leaf node) or all data items that descend from 

What is needed then is a method of organizing multi- 5 the node (i.e., all data items below the node that are 

dimensional/multi-altribute data in a DBMS, particularly a connected to the node through one or more intervening 

relational DBMS, in order to reap the advantages of sophis- nodes). 

ticated management controls (e.g., concurrent access to the In one embodiment of the invention in which a user's 

data) without sacrificing spatial relationships. query will likely match (within a range of exactitude) one of 

Advantageously, such a DBMS would provide for efficient 10 a set of known query patterns, the data items may be 

organization of the data to facilitate its rapid retrieval. clustered for indexing in an appropriately corresponding 

Retrieval of the data may be enhanced by applying an manner. Thus, if one query pattern expresses a particular 

effective buffering technique. order of hierarchy between the dimensions/attributes of the 

data, the data items may be divided and clustered accord- 

SUMMARY 15 m giy m order to create an index tailored to providing an 

In one embodiment of the invention a system and methods efficient response to an actual query. Multiple indexes may 

are provided for storing a hierarchical index of multi- thus be created (and stored in a database) for a given set of 

dimensional data in a relational database management sys- data items. 

tern (RDBMS). DESCRIPTION OF THE FIGURES 

In addition to providing for effective management of data 
that is inherently multi-dimensional (e.g., geographic, multi- FIG " 1 15 a block depicting a set of multi- 
media), this embodiment also provides for the storage and dimensional data items and one R-tree index that may be 
management of linear data that has multiple attributes. For constructed to index the data items in accordance with an 
example, a database of sales figures may be indexed accord- embodiment of the present invention, 
ing to attributes such as product, time, salesperson, geo- 25 FIG. 2 is a flowchart illustrating one method of storing a 
graphic region, etc. hierarchical index of multi-dimensional data in a relational 
In an embodiment of the invention, a set of multi- database m accordance with an embodiment of the inven- 
dimensional/multi- attribute data items is indexed by recur- 11011 * 
I sively dividing the data items into smaller clusters until each „ FIG* 3 depicts the division of a set of multi -dimensional 
\ cluster can be stored (i.e., indexed) in a single leaf node of data items into clusters, for storage in R-tree leaf nodes, in 
a hierarchical (e.g., tree -structured) index. In this accordance with an embodiment of the present invention, 
embodiment, when the set of data items or a subset thereof FIG. 4 is one R-tree index that may be constructed from 
is too large to fit in a single leaf node, a suitable dimension/ the set of multi-dimensional data items depicted in FIG. 3 in 
attribute in which to divide the data items is selected. The 35 accordance with an embodiment of the present invention, 
capacity of a node may be specified as a fanout characteristic FIG. 5 is a flowchart illustrating one method of indexing 
of the index or may be determined by a parameter of a a set of multi-dimensional data in accordance with an 
suitable physical storage device. embodiment of the invention. 

A dimension or attribute in which to divide the data may FIG. 6A depicts the addition of a data item to a set of 

be selected on the basis of which one consists of data item 40 multi-dimensional data items in accordance with an embodi- 

values having the greatest variance or range. Alternatively, ment of the present invention. 

a dimension may be selected based upon an expected or FIG. 6B depicts an R-tree index resulting from the 

specified query pattern. When a dividing dimension is expanded set of multi-dimensional data items of FIG. 6A in 

selected, the data items may be sorted in that dimension and accordance with an embodiment of the present invention, 

then divided into two or more subsets that contain equal or 45 pIG. 7A depicts a nearest neighbor query in a set of 

nearly equal numbers of data items. After leaf nodes are multidimensional data items in accordance with an embodi- 

constructed for clusters of data items, intermediate and, ment of me prcsent invention. 

finally, a root node may be constructed to complete the F , G ^ mustrates one r.^ iodex that ^ ^ 



index. 



structed from the set of multi-dimensional data items 



In one embodiment of the invention, a hierarchical index 50 depicted in FIG. 7 A in accordance with an embodiment of 

(e.g., an R-tree index) of multi-dimensional or multi- me preset invention 
attribute data may be stored in a database, such as a 

relational database management system. In this embodiment DETAILED DESCRIPTION 

a first object or table in the database is configured to store The following description is presented to enable any 

information concerning the index (e.g., its dimensionality, 55 person skilled in the art to make and use the invention, and 

fanout) and possibly an identifier (e.g., an address or storage is provided in the context of particular applications of the 

location, a unique node identity) of a root node of the index. invention and their requirements. Various modifications to 

A second object or table is configured to store a record or the disclosed embodiments will be readily apparent to those 

row for each node of the index. The multi-dimensional data skilled in the art and the general principles defined herein 

items may be stored in one or more objects or tables, in the eo may be applied to other embodiments and applications 

same or a different database. without departing from the spirit and scope of the present 

In the second object or table, each record for an index invention. Thus, the present invention is not intended to be 

node may consist of items such as: a unique identifier of the limited to the embodiments shown, but is to be accorded the 

corresponding node, an identifier of a parent node, an widest scope consistent with the principles and features 

identifier of a sibling node, a measure of the number of 65 disclosed herein. 

children of the node, and an entry for each child. In one The program environment in which a present embodiment 

embodiment of the invention each child entry includes an of the invention is executed illustratively incorporates a 
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general-purpose computer or a special purpose device such 
as a hand-held computer. Details of such devices (e.g., 
processor, memory, data storage and display) are well 
known and are omitted for the sake of clarity. 

It should also be understood that the techniques of the 
present invention might be implemented using a variety of 
technologies. For example, the methods described herein 
may be implemented in software executing on a computer 
system, or implemented in hardware utilizing either a com- 
bination of microprocessors or other specially designed 
application specific integrated circuits, programmable logic 
devices, or various combinations thereof. In particular, the 
methods described herein may be implemented by a series of 
computer-executable instructions residing on a storage 
medium such as a carrier wave, disk drive, or computer- 
readable medium. Exemplary forms of carrier waves may 
take the form of electrical, electromagnetic or optical signals 
conveying digital data streams along a local network or a 
publicly accessible network such as the Internet. 
Introduction 

A method and apparatus are provided for organizing 
multi-dimensional data in a relational or object-relational 
database management system, such as Oracle® Server by 
Oracle Corporation. In particular, methods and apparatus are 
provided for indexing multi-dimensional data items in a 
hierarchical index, storing the index in a database, and 
performing various operations on the index and/or data 
items. 

Embodiments of the invention described herein may be 
used to efficiently store, organize, manipulate and retrieve 
data for applications in the areas of geographical informa- 
tion systems (GIS), computer-aided design and computer- 
aided manufacturing (CAD/CAM), data warehousing, 
multi-media, etc. Various types of multi-dimensional data, 
such as geometrical, geographical, rectangular (e.g., ele- 
ments of a CAD/CAM project), and geoimage data, plus 
data representing image, audio and video feature vectors and 
data possessing multiple attributes, may thus be stored or 
indexed in a database in one or more embodiments of the 
invention. In particular, the data items that are indexed and 
manipulated in an embodiment of the invention may be 
point data or non -point data (e.g., spatial in nature). 

An embodiment of the invention may be installed as an 
integral part of any database server, whether part of a 
centralized computing environment, client-server environ- 
ment or otherwise. The index and/or data may, for example, 
reside on a client in order to minimize the expense of 
network activity and access to storage devices. The process- 
ing and input/output burdens caused by single/multiple 
operations on the index may be shared using single/ 
concurrent threads running on single or parallel processor 
machines operating with single/multiple storage devices. 

For purposes of embodiments of the invention described 
herein, multi-dimensional data may possess any number of 
dimensions (e.g., two or greater). Multi-dimensional data 
may also include data that possess multiple attributes or 
values in multiple dimensions that are not inherently related. 
For example, geographic data is often expressed in terms of 
latitude and longitude (or some other coordinate system). 
Geographic data may thus be considered inherently multi- 
dimensional because a latitude or longitude value, by itself, 
cannot adequately identify a geographic feature. 

Embodiments of the invention described herein are, 
however, also effective with data that are not inherently 
multi-dimensional but which may be accessed or manipu- 
lated on the basis of multiple attributes or values. For 
example, sales data may have attributes such as time (e.g., 
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time of sale), product (e.g., item sold) and area (e.g., region 
in which a sale occurred); Although sales information may 
be retrieved using just one of these attributes, reports may be 
desired that specify range predicates on two or more of these 

5 attributes (e.g., all sales of a particular product in a particular 
region during a specific time period). Although a basic sales 
data item may not be inherently multi-dimensional, storing 
or indexing it so as to facilitate its multi-dimensional 
manipulation may allow more efficient use of the data. Thus, 
the term multi-dimensional may be used herein to refer to 
multi-dimensional and/or multi -attribute data. 

Within a particular data dimension or attribute, a hierar- 
chy of selectivities or granularities may be specified. In 
various embodiments of the invention, one or more indexes 
may be constructed and stored in a database for a set of data 

15 items by choosing different selectivities for one or more 
dimensions or attributes. For example, a region dimension of 
sales data may include values for both city and state. A first 
index may then be constructed on the basis of city-level 
selectivity in the region dimension and, for example, a year 

20 selectivity in a time dimension. Another index may be 
constructed using state and year selectivities. 
R-Trees 

One embodiment of the invention provides a method and 
apparatus for efficiently storing and maintaining an R-tree 

25 index in a database management system (DBMS). An R-tree 
index stored within a database in this embodiment supports 
virtually all update operations (e.g., addition or deletion of 
a data item, database reorganization) and query operations, 
plus other database operations such as backup, recovery, 

30 security, etc. 

R-trees are often used to index data for applications such 
as those listed above because the R-tree structure has been 
found to be more effective for use with multi-dimensional 
data than many alternatives. An R-tree may index data 

35 having almost any number of dimensions but in order to 
provide the most efficient response to queries on the indexed 
data, it may be desirable to limit the dimensionality of the 
data to the range of two to approximately ten. 

Previous methods of storing R-trees often used file-based 

40 systems in which an R-tree was stored among several files, 
external to a database environment. Because the tree is not 
part of a database in such methods, such solutions preclude 
the ability to apply standard tools and utilities for database 
update, retrieval, security, concurrent access, etc. Some 

45 attempts have been made to store R-trees within the frame- 
work of a database system, but typically did so by storing an 
R-tree as a binary large object (BLOB) or an external file. 
With a BLOB, one R-tree node cannot be distinguished from 
another when accessing or manipulating the R-tree. 

50 Another method of indexing multi-dimensional data 
involved normalizing the data to a single dimension and then 
indexing it with a B-tree structure. This solution is inad- 
equate because of the resultant loss of spatial proximity, 
which can adversely affect the response time for queries. 

55 An R-tree suitable for storage in a database management 
system in a present embodiment of the invention consists of 
a root node and any number of leaf nodes, which may be 
connected to the root through a suitable quantity of inter- 
mediate nodes. The number of child nodes that the root or an 

60 intermediate node may have, and the number of data items 
a leaf node may have, is determined by the R-tree*s fanout, 
which is represented herein as M. M therefore denotes the 
capacity of an R-tree node. M may be any value greater than 
or equal to two, but for two-dimensional data a suitable 

65 value is approximately twenty-five. In general, values for M 
in the approximate range of ten to fifty work well with 
various embodiments of the invention. 
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Each data item indexed in this R-tree, and each region or 
collection of data corresponding to a subtree within the 
R-tree, is associated with a minimum bounding area, or 
MBA, that is defined in d dimensions (where d is the 
dimensionality of the indexed data) and that encompasses 5 
the corresponding data. Entries within a node (for either a 
child node or a data item) may be of the form <child, MBA>, 
where child is a pointer to or other identifier of a child node 
(for an intermediate or root node) or a data item (for a leaf 
node). 1Q 

Illustratively, the MBA portion of an entry comprises a 
definition of a boundary, the scope of which depends upon 
the type of entity of the associated child. If the child is a 
node, the MB A encompasses all of the data items within the 
subtree rooted at the child node (i.e., data items depending 
from the child node and/or all descendants of the child 15 
node). If the child is a data item, then the MBA simply 
encompasses that data item. .The different areas or regions of 
a dataset represented by different MBAs may overlap. 

In an R-tree that stores two-dimensional geographic data 
or the multi-attribute sale data described above, for example, 20 
an MBA may consist of a specification or definition of a 
two-dimensional shape (e.g., a rectangle) enclosing the 
points, areas or data associated with a child subtree. The 
shape may be specified as a set of geographic locations or 
points which, when connected, define a bounding area. An 25 
MBA may alternatively be expressed as a series of intervals 
or segments, one for each dimension in which the data has 
an indexed attribute or value. In general, therefore, an MBA 
may be expressed as a series of arrays or vectors, each array 
having d numbers or values (where d is the dimensionality 
of the data). 

Depending upon the database management system used to 
store an R-tree, special data types may be defined for the 
indexed data. For example, one version of the Oracle® 
Server database includes one or more data types designed to 
support geoimages and/or geometric data. 

In one embodiment of the invention, each node of an 
R-tree index includes the following fields, where applicable, 
prior to storage in a database: 

40 



Level Height of the node (e.g., leaf nodes are at level one, 

their parents are at level two, etc) 
Node_id Unique node identifier 

Child_count Number of child entries in the node 

Sibling Identifies (e.g., via a pointer) the node's right-hand 45 

sibling 

Parent Identifies (e.g., via a pointer) the node's parent 

(primarily used for root nodes that split because of 
data insertion) 

Parent__node_id Node_id of parent node (primarily used for root 

nodes that split because of data insertion) 50 

Entries Array (of up to M two-part entries), where each 

entry is of the form cchild, MBA> 



30 



35 



If R-tree 150 is assumed to have a fanout of two, root node 
151 cannot store all four data rectangles, thus requiring leaf 
nodes 161, 162. In the illustrated embodiment, when con- 
structing R-tree 150 for data rectangles 112, 114, 122, 124, 
the data rectangles are first clustered into MBAs. Because 
the four data rectangles may be stored in just two leaf nodes, 
it is sufficient to create two clusters. Thus, minimum bound- 
ing area Bl is defined to include rectangles 112, 114, which 
are stored in leaf node 161, while MBA B2 is defined to 
include rectangles 122, 124, which are stored in leaf node 
162. Root node 151 thus comprises two entries: <node 161, 
Bl> and <node 162, B2>. Note that the MBAs of R-tree 
nodes may or may not overlap. 
Storing an R-Tree Index in a Relational Database 

This section introduces a suitable method and form for 
storing an R-tree index in a relational database management 
system in one embodiment of the invention. One skilled in 
the art will appreciate that by storing an R-tree within a 
database management system, the R-tree may be manipu- 
lated using normal database tools and operations. In 
particular, because the R-tree exists within the database 
framework, it receives the benefits of database features such 
as security, concurrency control, backup and recovery, etc. 

In this embodiment three database objects are employed 
to store an R-tree: an INDEX table, a METADATA table and 
a Node__id generator. In addition, individual data items 
(e.g., sales data, geographical coordinates, other multi- 
dimensional data) may be stored in a separate DATA table 
(e.g., an object-relational table). The composition of 
INDEX, METADATA or DATA tables discussed herein may 
be modified (e.g., by adding or removing columns or 
attributes), merged or divided to suit a particular application 
or environment without exceeding the scope of the present, 
invention. 

Illustratively, an INDEX table stores the nodes of an 
R-tree index, while a METADATA table stores information 
about the R-tree itself (e.g., dimensionality, identity of root 
node, fanout). In this embodiment, each row of the INDEX 
table corresponds to a separate node of the R-tree. A 
Node__id generator (which may be created at the time the 
R-tree index is created) generates unique Node_ids for 
nodes of the tree. Besides the Node^td, however, a node or 
an entry in a table may be referenced or accessed by its 
Row_jd (i.e., the location of the entry on a storage device 
such as a disk drive, memory, etc.), a pointer or other 
identifier. 

In one embodiment of the invention an INDEX table of a 
database includes one or more of the following columns or 
attributes, where applicable, for each node of an R-tree 
index: 



Illustratively, the child element of a root or intermediate 
node entry in the Entries array is an identifier (e.g., a pointer, 55 
Node_id, disk location) of a child node. For a data item 
entry (i.e., a child of a leaf node), the child element is an 
identifier of an individual data item. 

FIG. 1 depicts two-dimensional dataspace 100 containing 
data items (a first data item is identified by the reference 60 
numeral 102a) and a representative R-tree that may be used 
to index the data in one embodiment of the invention. In 
FIG. 1, four data rectangles are depicted having reference 
numerals 112, 114, 122 and 124. The data rectangles may 
represent areas within a CAD/CAM project, areas of a 65 
geographical map, portions of geoimages, collections of 
sales data expressed in two dimensions, etc. 



Level Height of the node (e.g., leaves are at level one, 

their parents are at level two, etc) 
Node_id Unique node identifier 

Row_id Location of node in storage device 

Child_count Number of child entries in the node 
Sibling Identifier of the node's right-hand sibling 

Parent__node_id Node_id of parent node (primarily used for root 

nodes that split because of tree update) 
Parent_row_id Row_id of parent node (primarily used for root 

nodes that split because of tree update) 
Entries Array of two-part entries (up to M in size), each 

entry being of the form <child, MBA> 



An illustrative METADATA table may comprise separate 
rows for each R-tree index and include the following col- 
umns or attributes for each index: 
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Dimensionality Dimensionality of the index 

Root Identifier of the root node in an INDEX table 

Fanout Fanout of the R-tree index 



As described previously, nodes and data items may be 
identified in many different manners — such as by their 
Row_jd, Node^id, a pointer, etc. In particular, the Sibling 
and Parent columns of an INDEX table and the Root column 
of the METADATA table may identify nodes by their 
Row_ids. Also, the child element of an entry in the Entries 
array of the INDEX table may be a Row_id of a child node 
or a data item in a DATA table. 

In one embodiment of the invention, the size or capacity 
of leaf nodes and/or other nodes of an R-tree index may be 
determined by a parameter of the computing environment in 
which the database is maintained. For example, the capacity 
may correspond to a page size (or other unit of measure) of 
a suitable storage device (e.g., disk drive, memory). 

FIG. 2 is a flowchart illustrating one method of storing an 
R-tree index in a relational database system. In FIG. 2, state 
200 is a start state. 

In state 202/ a new index for a set of multi-dimensional 
data items is received or constructed. The new index may be 
the only index for the data items or may be one of multiple 
indexes, possibly constructed according to a different set or 
order of data dimensions or attributes. 

In state 204 the data items are stored in one or more tables 
in a relational database. The type and configuration of table 
may depend upon the datatype of the data items. 

In state 206 a METADATA table is configured with one or 
more of the fields described above. This table may already 
exist (e.g., if another index is already stored in the database) 
or may be newly generated. In state 208 an entry is stored 
in the METADATA table for the new index. 

In state 210 an INDEX table is constructed to store the 
nodes of the new index and is configured to include one or 



index may be stored in a database using a method discussed 
in the preceding section. One skilled in the art will appre- 
ciate that the methods described in this subsection may be 
modified and other similar methods developed without 

5 exceeding the scope of the invention. 

As will be appreciated by those skilled in the art, the 
present invention may be used to create an index for 
virtually any type of multi-dimensional or multi- attribute 
data. In particular, embodiments of the invention are 

10 intended for use in indexing non-point data as well as point 
data. 

In one embodiment of the invention a VAMSplit algo- 
rithm (extended for polygons by using their centroids) is 
particularly effective for creating an R-tree index from 

15 multi-dimensional data. Other methods of creating an R-tree 
are disadvantageous for various reasons, A hilbert R-tree 
algorithm, for example, linearizes the R-tree data, losing 
spatial proximity in the process, and a tile-recursive algo- 
rithm yields poor clustering when the R-tree data is not 

20 distributed fairly evenly across the multiple dimensions. 
In one effective application of a VAMSplit algorithm, a set 
of N data points is divided whenever N> M (i.e., there are 
too many data points to fit in a single node). In order to 
divide the dataset most equitably, the distribution of data 

25 values within each dimension is computed (e.g., the differ- 
ence between the smallest and greatest values is 
determined). The data is then sorted in the dimension having 
the greatest variance and the sorted dataset is then divided in 
that dimension as close to a median value as possible. 

30 Advantageously, a multi-dimensional data index may 
allow all desired data to be retrieved in a single query. This 
is more efficient than invoking multiple queries against 
uni -dimensional indexes (e.g., and then determining a union 
of the results). 

35 FIG. 3 illustrates one method of dividing a set of two- 
dimensional data points to create an R-tree index according 
to one embodiment of the invention. For the purposes of 
FIG. 3, it may be assumed that the fanout, M, for the target 
R-tree is equal to three. The x- and y-axes may represent any 



more of the fields described above, and possibly other fields 
as well. In particular, and as described below, in one 40 suitable indicia, such as latitude and longitude, color and 



embodiment of the invention an index node may store data 
other than that used to construct the index. For example, in 
an index of multi-attribute sales data each leaf node may 
store a summary of all sales or profits generated by its 
individual data item children. Each node above the leaf node 
level may store aggregations of data items of descendant 
nodes. Thus, each node stores a summary of sales or profits 
commensurate with its level in the index. Such summaries or 
aggregations of non-indexed data may be rapidly reported 
with having to access individual data items. In state 212 an 
entry is stored in the INDEX table for each node of the 
index. The illustrated method then ends with end state 214. 
Manipulating an R-Tree Within a Database Environment 

This section presents illustrative methods for creating and 
manipulating an R-tree index within a database system in 
one embodiment of the invention. One skilled in the art will 
appreciate that other suitable methods may be applied or 
may be derived from those described herein without exceed- 
ing the scope of the present invention. 
Creating an R-TREE Index for Storage in a Database 

Various methods of indexing multi-dimensional or multi- 
attribute data in an R-tree structure are presented in this 
subsection. Following a discussion of how multi- 
dimensional data in general may be indexed, a method of 
indexing multi-attribute data is presented. As explained 
above, multi- attribute data may be considered a special case 
of multi-dimensional data. In either case, the resulting R-tree 



intensity, etc. 

In FIG. 3, dataset 300 comprises multiple data points (a 
first data point is represented by the numeral 302a) that vary 
more in the x dimension than the y dimension. They are 

45 therefore sorted according to their values in the x dimension 
and a first division of the data is made at an approximate 
median, which is illustrated by dividing line 310. 

By computing and applying an approximate median value 
the storage utilization and efficiency for a given R-tree 

50 fanout may be maximized. An approximate median, m(N), 
for N data points may be computed as follows. If the number 
of data points to be divided is less than twice the fanout 
value (i.e., N<2*M), an effective approximate median is 
equal to the floor (i.e., the truncated or rounded down) value 

55 of N divided by two (i.e., m(N)=floor(N/2)). Otherwise (i.e., 
N>2*M), an approximate median may be determined by the 
equation 

m(N)»M *floor(N/(2* M>f0.5> 

60 In FIG. 2, N-ll and, with M-3: 

m(l 1)=3* floor(ll/(2*3)+0.5)=6, 

indicating that an effective location for a first division is after 
the sixth point in the selected (i,e., x) dimension, which is 
65 illustrated by dividing line 310. 

After the first partition, it is determined whether each new 
subset of data will fit into a single R-tree node. Each subset 
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that is too large (i.e., each subset that contains more than M An existing method of indexing multi-attribute data sim- 

data points) is further divided in a similar manner. ply consists of constructing an index on one of the attributes. 

Thus, a first data subset (e.g., that which includes values Such an index, however, is inefficient for responding to a 

in the x dimension less than m(ll)) is further subdivided as query based on a different dimension or a query based on 

follows. This data subset possesses greater variance in the y 5 values in two or more dimensions. For instance, a report or 

dimension than the x dimension, and the approximate a <l ue T ma y be P osed on values io b°th ^me and region 

median (denoted by m(6)) for the subset is computed attributes (i.e., a query mvolves a conjunction of predicates 

accordingly. Because N=2*M, 0Q me time and me re S 10n attributes). Having an index on 

just one of these two attributes does little to facilitate rapid 

m(6)=3*aoor(6/(2*3}+o.5>3 10 processing of the request because the data that satisfies both 

predicates may be scattered within the index. Likewise, 

and the first subset is divided after the third data point in the having separate indexes (e.g., one each for the time and 

y dimension. Dividing line 312 illustrates this division. region attributes) may also be quite slow due to the need for 

The second subset (i.e., the subset having values in the x multiple index searches and merging of results, 

dimension greater than m(ll)) yields an approximate A solution in one embodiment of the invention is to 

median value in the x dimension of 2 (represented on the 15 construct a combined index for multiple attributes of the 

x-axis by m(5)). In particular, N<2*M, therefore data - In particular, an R-tree index may be generated using 

one or more dimensions, where each dimension represents a 

m(5)-floor(5#)-2 multi-tiered or hierarchically structured data attribute. In 

other words, a region dimension may comprise attributes or 

and the second subset is divided after the second data point. 20 sub-attributes that have a hierarchical arrangement, such as 

This division is represented by dividing line 314. It will be State and City (i.e., every city is part of a state), 

apparent that each of the approximate median values com- In generating such an index, data items may be clustered 

puted above is relative to the subject dataset or data subset. in a manner that facilitates relatively rapid responses to a 

Thus, the representations of m(ll), m(6) and m(5) on the x- query that matches a set of expected query patterns, 

and y-axes should be considered relative to the origin or the 25 Illustratively, a query pattern or a set of query patterns 

preceding approximate median value, as appropriate. specifies a set of dimensions (or hierarchically structured 

After the second subset is divided, each subdivision, or attributes) and an expected selectivity for the query predi- 

cluster, of data points can now fit into a node of an R-tree cates on those dimensions or attribute hierarchies. By way of 

having a fanout value of three; therefore no further division illustration, in the sales example above queries may always 

is required. Ousters 320, 322, 324, 326 in FIG. 3 are 30 (or always be expected to) retrieve data in terms of years and 

demarcated by dotted boundaries. Now that the data is cities. In this illustration two dimensions/attribute hierar- 

clustered appropriately, the R-tree index may be constructed chies are expressed — the first being time, for which the 

by placing each cluster of data items into a separate leaf selectivity is yearly, and the second being region (consisting 

node (e.g., by placing identifiers of each data item and its of state and city attributes), for which the selectivity is city, 

storage location or address in the appropriate leaf node). 35 Thus, if a user is likely to retrieve data based on one or 

From the leaf nodes, parent nodes may be formed such that more particular attributes, and/or with a particular range of 

each entry in a parent node comprises an MBA of a cluster values within an attribute, the R-tree can be constructed in 

of data and an identifier of (e.g., a pointer to) the corre- such a manner as to respond to these queries in a very 

sponding leaf node. In similar fashion, grandparent nodes of efficient and responsive manner. The possible data values for 

the leaf nodes, and higher nodes as necessary, may be 40 each attribute or dimension may be discrete in nature (e.g., 

formed. Eventually a root node of the R-tree is constructed. specific dates or cities) or may be continuous (e.g., latitude 

FIG. 4 depicts R-tree 400, one possible result of indexing and longitude). In different embodiments of the invention a 

dataset 300 of FIG. 3. In FIG. 4, leaf nodes 420-426 are user may design or specify a query pattern, select one from 

created for clusters 320-326. Because the number of leaf a number of options, or a default pattern may be assumed, 

nodes (i.e., four) is greater than M (i.e., three), intermediate 45 In one embodiment of the invention multiple R-tree 

nodes 412, 414 are created to index the leaf nodes. Root indexes may be created from a given set or subset of 

node 402 is then constructed, with entries for nodes 412, multi-attribute data. Thus, if users employ multiple query 

414. R-tree 400 may be stored in a database by a method patterns for the data a different index may be used to respond 

described above. to different queries. The different indexes may reflect dif- 

The same approach may be extended to cluster non-point 50 ferent orderings of the data dimensions, different scopes or 

datasets consisting of polygons and other spatial data. hierarchies within a particular attribute, etc. A query ana- 

Illustratively, for such non-point data the centroids of the lyzer or similar tool may be employed to determine which of 

data items are used to divide the dataset and cluster data multiple established query patterns a user's actual query 

items into leaf nodes. However, an MBA of a leaf node may most closely matches. In addition, one or more query 

reflect the full range of data items in the node. The remainder 55 patterns may be "learned** from a user's activity and a 

of the tree may then be constructed as described above. suitable index may then be constructed. 

As described previously, "multi-dimensional data" can In storing multi-attribute data in a database, each dimen- 

include data that is inherently multi-dimensional (e.g., geo- sion or attribute hierarchy may be stored in any of a number 

graphical data) and can also include data having multiple of different forms. For example, each dimension or attribute 

attributes, such as sales data having characteristics including 60 tier may be stored as a separate object, as a separate row in 

time of sale, region of sale, product sold, etc. In one a table, its values may be stored as separate columns in a 

embodiment of the invention multi-attribute data may be table (e.g., one column for each year, each quarter of a year, 

indexed in a manner different from inherently multi- each month), etc. The range of values of a dimension may 

dimensional data. Because of the similar manners in which also be stored as a hierarchy, such as year, quarter, month, 

these types of data may be stored, however, the following 65 and day for a time dimension. 

discussion may use the terms attribute and dimension inter- One algorithm for generating an R-tree index to store 

changeably. multi-attribute data in a database is provided below. In this 
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algorithm data is separated or clustered on the basis of query 
retrieval units. Illustratively, a query retrieval unit represents 
the minimal scope or granularity (i.e., selectivity) with 
which data is indexed in a particular dimension, which scope 
is specified in the operative query pattern. For example, if 
the specified query pattern retrieves data for three different 
products (i.e., the product dimension has three values), then 
there are considered to be three query retrieval units in the 
product dimension. As another example, consider a time 
dimension. If the sales data reflects sales over a two-year 
period, we could define our query retrieval units to be years, 
quarters, months, etc., depending on the specified query 
pattern. If we choose quarters, then there would be eight 
query retrieval units; if we choose months, then there are 
twenty-four. 

An illustrative algorithm for clustering multi- attribute 
data items for leaf nodes of an R-tree index is now provided: 

1. Store all data items in a single node if possible (i.e., 
depending upon the fanout of the index) 

2. Otherwise, until each subset or cluster of data items will 
fit into a single node, do: 

2.1 Select a dimension in which to divide an over- 
populated cluster or subset of data items: 

2.1.1 Compute the number of query retrieval units in 
each dimension: 

2.1.1.1 For attributes having discrete values, deter- 
mine the number of possible values (e.g., number 
of cities for a region dimension or brand names for 
a product dimension) 

2.1.1.2 For attributes having a continuous range of 
values (e.g., latitude), a query pattern may specify 
a percentage, P, of the values in that domain that 
may be accessed in a particular query; the number 
of query retrieval units is then equal to (100/P)* 
(range of subset/range of entire set) 

2.1.1.3 Some attributes (e.g., time) may be repre- 
sented by discrete values (e.g., days, months, 
quarters) or a continuous range; the number of 
query retrieval units is measured accordingly 

2.1.2 The attribute that has the most query retrieval 
units is selected as the dimension in which to divide 
the data items 

2.2 Sort the data items in the selected dimension 

2.3 Divide the data items, possibly to yield an 
(approximately) equal number of data items in each 
subset or, as one alternative, in a manner that yields a 
number of data subsets one or more of which will fit 
into individual leaf nodes 

3. Repeat steps 1 and 2 until each subset fits into a single 
node 

It should be noted that for attributes having values that fall 
within a continuous range (e.g., latitude, longitude, possibly 
time) instead of being expressed as discrete values, a query 
pattern selected by a user or adopted for the construction of 
an index may specify a likely percentage of the range of 
attribute values. For example, if values for a latitude dimen- 
sion have a total range of ten degrees, a query pattern may 
specify that a query (e.g., a query window) is expected to be 
only one-half of one degree wide. Thus, five percent of the 
dimension's range may be invoked during each query and 
the resulting number of query retrieval units in this example 
is twenty. 

At the time that data items are to be divided in a selected 
dimension, different criteria may apply for selecting an 
optimal point of division. In general, when the dimension in 
which the data items are divided is marked by discrete 
values, the point at which the data items are divided (e.g., in 
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Step 2.3 above) is selected to ensure a clear demarcation 
between items in one subset and those in another subset. 
When the dividing dimension is marked by continuous 
values, however, the selected dividing point is intended to 

5 minimize the overlap of the MBAs in the resulting index. In 
one embodiment of the invention data items are divided by 
calculating an approximate median as described earlier in 
this subsection. The calculated approximate median may be 
adjusted somewhat in order to achieve a more efficient 

1Q ordering of the data items. In addition, when query retrieval 
units comprise intervals or ranges rather than discrete 
values, the center of the intervals may be used to calculate 
variance and for other purposes. 

After leaf nodes of the index are constructed for the data 
items, nodes at the next higher level of the index may be 

15 generated based on the leaf nodes. In like fashion, succes- 
sively higher levels of the index may be populated until a 
root node is put in place. 

Multiple indexes can be constructed by choosing different 
selectivity values for a dimension or attribute hierarchy. For 

20 example, one index may be constructed to cluster on year 
and city granularities/selectivities (e.g., for time and region 
dimensions) as in an earlier example. Another index may be 
constructed using year and state granularities. The number 
of possible index structures increases as the hierarchy (e.g., 

25 number of tiers) in each dimension or attribute increases. 
Hius, if a county attribute is added to the region dimension 
then query patterns could be designed accordingly, which 
may affect the number of query retrieval units and the 
manner in which the dataset is divided. As one skilled in the 

3Q art will appreciate, multiple indexes are useful in parallel 
evaluation of expensive OLAP (OnLine Analytical 
Processing) operations, such as CUBE in warehousing 
applications. 

In one embodiment of the invention an index of multi- 
dimensional or multi-attribute data may also store informa- 

35 tion derived from data items, in addition to storing infor- 
mation concerning the data dimensions or attributes. For 
example, nodes of an R-tree index for sales data may store 
sales or profit data in addition to values for searchable 
attributes such as state, city, year, product, etc. Searchable 

40 attributes refer to those attributes or dimensions that may be 
specified as part of a search or query — such as in a WHERE 
clause of a SELECT statement in SQL (Structured Query 
Language). 

The sales or profit data stored in index nodes in this 

45 example may be aggregated. In other words, a leaf node may 
store the combined sales or profit figures for all of its data 
items. The next node above that leaf node may store the 
aggregated sales or profit figures for all of its leaf node 
children, and so on. In this embodiment, only the searchable 

50 attributes (e.g., state, city, year, product) are used to form the 
index (i.e. to divide and cluster the data items, form MBAs). 
Storing the secondary data (e.g., sales, profits) in index 
nodes allows the rapid generation of summary reports with- 
out having to access a DATA table or individual data items. 

55 FIG. 5 is a flowchart illustrating one method of construct- 
ing a hierarchical index f or a set of multi -dimensional data. 
In FI(J~5, state 50(fisTstart state. 

In state 502, a set of multi-dimensional or multi-attribute 
data items is selected for indexing. In state 504 the number 

60 of data items (e.g., point data items, polygons, etc.) in the 
dataspace is counted or otherwise determined. 

If, in state 506, it is determined that all of the data items 
will fit into one node (i.e., the total number of data items is 
no greater than the node capacity of the index), the data 

65 items are placed (e.g., associated with) one leaf node in state 
522, after which the illustrated procedure continues at state 
524. 
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Otherwise, if the number of data items in the dataspace is 
greater than the node capacity of the index, in state 508 the 
variance within each dimension or attribute (or attribute 
hierarchy) is determined. 

In state 510 the dimension or attribute hierarchy having 5 
the greatest variance is selected and, in state 512, the data 
items are sorted in the selected dimension. 

In state 514 the data items are divided in the selected 
dimension into two or more subsets. An approximate median 
may be computed as described above in order to divide the 
data items in half as nearly as possible. Alternatively, the 
data items may be divided into a number of clusters, each of 
which contains a number of data items that will fit into one 
leaf node. 

In state 516 the number of data items in each subset is 
calculated to determine if any of the subset need to be further 1 5 
subdivided to yield clusters that will fit into leaf nodes. 

If, in state 518, it is determined that none of the subsets 
have a greater number of data items than the node capacity 
of the index, the data items in each cluster/subset are placed 
in a separate leaf node in state 522 and the illustrated 20 
procedure proceeds to state 524. 

Otherwise, if a subset has too many data items, one such 
subset is selected in state 520. States 508-520 are then 
performed repeatedly, as necessary. In this manner, the set of 
all data items is divided and subdivided as necessary to yield 25 
clusters of data items small enough to fit into individual leaf 
nodes. 

In state 522, each data item is stored or associated with a 
leaf node. Illustratively, an entry is made in the leaf node, to 
contain an identifier of the data item (e.g., by Row_id) and 30 
a suitable MBA, which will closely fit the data item in one 
embodiment of the invention. 

In state 524, the higher levels of nodes in the index (e.g., 
levels 2 and up) are configured. Illustratively, a node is 
configured at level 2 for every L leaf nodes, where L^M. In 35 
particular, nodes at level 2 and above may be formed so as 
to minimize the total area or volume associated with the 
node. Each newly configured node receives an entry for each 
child node, consisting of an identifier (e.g., Node_id, Row_ 
id) and an MBA that encompasses the MBAs of each entry 40 
in the child node. State 524 finishes with the configuration 
of a root node. The illustrated procedure then ends at end 
state 526. 

Inserting or Deleting a Data Item 

Update operations (e.g., insertion or deletion of data) 45 
traverse an R-tree as necessary to carry out the addition or 
removal of a data item and ensure that the R-tree index is 
updated as necessary to reflect the modification. This sub- 
section presents illustrative methods of inserting and delet- 
ing a data item in an R-tree index that is or is to be stored 50 
in a database. The data item may be any type of multi- 
dimensional (including multi-attribute) data. 

When adding a data item in a present embodiment of the 
invention, an R-tree index is traversed from the root to find 
an appropriate leaf node in which to insert the data item. 55 
Illustratively, the choice of which branch or subtree to 
follow from a given node is made based upon which child 
node would require the smallest increase in its MBA if the 
new data item were added. However, if the node at the root 
of a selected subtree has as children leaf nodes whose MBAs 60 
overlap, then the child that is chosen to receive the data item 
is the one whose expanded MBA (i.e., to receive the new 
data item) results in the least overlap (if any). If, perchance, 
the new data item could be added to more than one leaf node 
and result in the same or nearly the same amount of overlap, 65 
then the leaf node whose MBA increases in size the least is 
chosen. 



,605 Bl 

16 

When a leaf node chosen to receive a new data item would 
exceed its capacity (i.e., M, the R-tree fanout), the node is 
split into two nodes by dividing the entries (i.e., the existing 
entries plus the new item). In one embodiment, the data 
items are divided in such a manner that the overlap between 
the nodes* MBAs and the size of their respective MBAs are 
minimized. A method similar to that described above for the 
initial partitioning of R-tree data may be applied to divide 
the nodes. 

When a leaf node splits, ancestors of the affected node 
may need to be updated as well (e.g., to adjust MBAs). Thus, 
the split propagates upward until the old and new leaf nodes 
are correctly connected to the root and the MBAs of inter- 
mediate nodes are adjusted as necessary to reflect the change 
in the index structure and composition. 

A suitable algorithm for inserting a new data item in a leaf 
node of an R-tree T, as an entry in the form identifier, 
MBA> follows (where identifier may be a Node__id, Row_ 
id or other identifier of the data item, and MBA is a suitable 
bounding area encompassing the data item): 
[Step 1 — Locate a leaf node for inserting the new data item] 

1.1 Set insertion-node to the root of T 

1.2 Initialize insertion-path (e.g., a stack) to record the path 
of this update operation 

1.3 While insertion-node is not a leaf node, do: 

1.3.1 Compare entries in insertion -node; select the entry 
that would cause the least increase in its MBA area to 
add the data item (or the least increase in MBA overlap 
if insertion-node's children are leaf nodes) 

1.3.2 Push insertion-node onto insertion-path 

1.3.3 Set insertion-node to the node associated with the 
entry selected in Step 1.3.1 

1.3.4 Push insertion-node onto insertion-path 

[Step 2 — Insert data item in leaf node and update nodes on 
insertion-path] 

2.1 Initialize insertion -entry to the new data item 

2.2 Initialize modified-entry to nil (i.e., empty) 

2.3 While insertion-path is not empty, do: 

2.3.1 Pop the top node in insertion-path, call it nodes 

2.3.2 Set old-MBA to the MBA of nodes 

2.3.3 If modified-entry is not empty, then: 

2.3.3.1 Identify the entry in node,- whose child reference 
(e.g., pointer, Row__id, Node__id) matches that of 
modified-entry 

2.3.3.2 Replace the MBA of that entry in node t - with the 
MBA of modified-entry 

2.3.4 If insertion -entry is not nil, then (insert it in node,-): 

2.3.4.1 If node,- is not full (e.g., has less than M entries): 

23.4.1.1 Add insertion-entry to node t 

23.4.1.2 Increment child counter of node, 
23.4.13 Set insertion-entry to nil (i.e., empty) 

2.3.4.2 If node,- is full: 

23.4.2.1 Create new node, call it node^- 

23.4.2.2 Divide the entries in node,, plus insertion- 
entry, into two subsets; store one in node,- and the 
other in node rt 

2.3.4.23 Set insertion-entry to the entry <node#, 
MBA of node^ 

2.3.5 Set new-MBA to the new (e.g., recalculated MBA) 
of nodes 

2.3.6 Compare old-MBA and new-MBA to determine if 
the MBA of node,- changed during the update. If so, set 
modified-entry to <node ( ., new MBA of node,->; 
otherwise, set it to nil (i.e., empty) 

23.7 If modified-entry and insertion-entry are nil, end the 
algorithm (i.e the update does not propagate to the 
parent of insertion-entry) 
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2.4 If insertion-entry and modified-entry are not nil (i.e., the decreased by deleting a relatively small number of entries, 
root node of T has split), create a new root node with If the quantity or ratio of empty and inefficient nodes meets 
entries for these two entries and update the METADATA a certain criteria (e.g., a range of approximately 5% to 
table accordingly approximately 10% of the total number of index nodes) then 
Although this algorithm is configured to add a new data 5 the tree is restructured. In an illustrative restructuring opera- 
item to an R-tree index, one skilled in the art will appreciate tion empty nodes may be deleted and entries may be deleted 
how it may be modified to produce a suitable algorithm for from under-filled or inefficient nodes and re-inserted into the 
deleting a data item. index. In one implementation of this embodiment other 
FIG. 6A depicts dataset 600, with new data item 602 that update operations on the R-tree are suspended during the 
is to be added to an R-tree by applying the algorithm 10 reorganization function. Giving only the reorganization 
provided above. FIG. 6B depicts R-tree 650 resulting from function the ability to update the R-tree helps prevent the 
the data insertion operation. The process of adding data item index from being corrupted during the operation. 
602 may be described as follows. In a previous method of updating a particular type of 
The location phase of the algorithm commences at the R-tree known as an R*-tree, a data item could be forcibly 
root. At root node 652, the entry for node 664 is chosen for 15 inserted into a node (e.g., if the node is already full). And, 
insertion because the MBA of the subtree rooted at node 664 when a data item was deleted from an R*-tree, its node may 
requires the least expansion to include data item 602. At be deleted (if empty) or merged with another. As discussed 
node 664, it is determined that the MBA for leaf node 674 above, however, a method of updating an R-tree in one 
(comprising the data items of cluster 624 before the embodiment of the present invention splits nodes instead of 
insertion) would require less expansion than the MBA of 20 forcibly inserting entries and postpones treatment of empty 
leaf node 676 (comprising the data items of cluster 626). or under-filled nodes. In addition, the previous method of 
Thus, at the end of the location phase, the insertion path updating an R*-tree was designed to work with a file-based 
consists of node 674, node 664, and root node 652. index, not an index that has been stored in a database system. 

It should be apparent that adding an entry for data item An R-tree suitable for storage in a database management 
602 to leaf node 674 in the insertion phase of the algorithm 25 system in one embodiment of the invention may support a 
will cause node 674 to split because it would contain more . number of other operations, such as a window query, a 
than M (i.e., three) entries. Thus, node 678 is created to nearest-neighbor query or an intersection join. These opera- 
accept the extra data. The data items of node 674 plus item tions are discussed in the following subsections. 
602 are then divided among nodes 674, 678. In particular, if In one embodiment of the invention, query operations 
the partitioning method described above is applied the data 30 return or identify data items whose MBAs satisfy the 
items are split along the x-axis. Cluster 624 is thus replaced specified query parameters). The degree of correlation 
by clusters 624a, 624b corresponding to nodes 674, 678, between the query parameters) and the data that is returned 
respectively. may depend upon the type of data involved. For example, for 
In the second iteration of the insertion phase of the most data that is not inherently multi-dimensional (e.g., 
algorithm, node 664 is visited. The MBAof the entry in node 35 from data warehousing applications, multi-media 
664 for leaf node 674 is modified to reflect the change in data applications, etc.), the data identified in response to a query 
item distribution. Also, an entry for node 678 is added to will likely be quite accurate even though the data items were 
node 664. And, because the MBAof the entire subtree rooted located on the basis of their MBAs rather than their specific 
at node 664 did not change, the algorithm ends. The end values. For applications such as GIS, however, the use of 
result is depicted in FIG. 6B. 40 MBAs may limit the precision with which identified data 
The data items in the individual leaf nodes are not items match a user's query. For these applications some 
depicted in either FIG. 4 or FIG. 6B. As described above, in post-query processing may be applied to determine the most 
one embodiment of the invention data items are maintained appropriate results for a user in a given application or 
in a separate DATA table, in which case each item is domain, 
identified in a leaf node by its Row_id or some other 45 Window Query 

suitable identifier. One difference in the leaf nodes between One type of query that is supported in one embodiment of 

FIG. 4 and FIG. 6B is that in FIG. 6B leaf node 674 has been the invention is a window query. Illustratively, in a window 

modified to include only the data items in new cluster 624a query a user specifies a window (e.g., a set of ranges or 

(vice original cluster 624) and leaf node 678 has been added points in each dimension, a rectangle drawn on a graphical 

for cluster 624b. 50 interface) and data having a specified relationship to the 

The processing of a deletion operation proceeds in a window is retrieved or identified. The specified relationship 

manner analogous to the insertion of a new data item. In is provided as a selection criterion of the user. Illustrative 

particular, a deletion operation commences at the root node selection criteria include intersection, containment, enclo- 

and propagates downward through all nodes and subtrees sure and match. A window may be of virtually any size or 

that intersect an MBA containing the data item to be 55 shape. 

removed. Eventually, the leaf node containing the target data Illustratively, a window query submitted with a criterion 
item is identified and the data item is deleted. of "intersection" (i.e., an "intersection query" retrieves data 
In one embodiment of the invention a leaf node is not items having MBAs that intersect the specified query win- 
deleted even when the last data item within the node is dow. Similarly, containment queries identify data items 
removed. The MBAs of all ancestor nodes are, however, 60 whose MBAs are completely contained within the query 
updated as necessary. In this embodiment a reorganization window. Conversely, enclosure queries retrieve data items 
function is provided to restructure an R-tree for more whose MBAs completely enclose the query window. Finally, 
efficient operation. Illustratively, the reorganization function match queries identify data items whose MBAs match the 
identifies empty or under-filled nodes (e.g., nodes with less query window (e.g., exactly or within a specifiable range of 
than 40% occupancy) and/or nodes with inefficient structure. 65 similarity). 

A node may be considered to possess an inefficient structure For a given query window and set of data items or 

if, for example, the area of its MBA can be substantially datasets, in one embodiment of the invention any data item 
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that satisfies a match query will also satisfy an intersection, included in the set of candidate neighbors as long as the set 

containment and enclosure query. Further, any data item that contains less than k neighbors. Otherwise, if the distance 

satisfies an enclosure or containment query will also satisfy from the query to the data item is less than the distance from 

an intersection query. me query to any other data items in the set, the data item 

To answer or respond to a window query in one embodi- 5 replaces the candidate neighbor farthest from the query, 

ment of the invention, a search for target data items com- ^ P rocess » repeated by removing entries from the 

mences at the root and propagates toward the leaf nodes in 1 ueue until the farthest neighbor in the set of candidate 

a recursive manner. Starting with the root, at each non-leaf neighbors * cl u oser to < he W than any entries m the .queue, 

node visited in response to the query it is determined which At th !» une - the »t °f candidate neighbors is returned as the 

child nodes should be examined, depending upon the speci- 10 xt ? f k " aea f^ J° the query. 

„ , . . . „ . ' . • * T- i ^ °ne skilled in the art will appreciate, the presently 

fiedentonon.^ described method of findi { £\ nearest . n ; ighbors 

child nodes may be considered at once; otherwise, only one accesseg a minimmn mmh&T * Qodes ^ & iveQ mdex A 

is invesugated at a time and other child nodes that should be pre vious algorithm, which was developed in the context of 

visited are marked for later examination. A stack, list or quadtrces but ^ ako applicable to R-trees, requires space on 

other data structure may be maintained during the process- 15 me onler c f Q(n) for the priority queue, where n is the 

ing of a query to store nodes that are to be examined. number of multi-dimensional data items. Advantageously, 

Illustratively, if the selection criterion is either intersec- the present technique requires space for the priority queue 

tion or containment, query processing propagates to those on the order of 6(number of nodes) and space for the 

child nodes/subtrees whose MBAs intersect the query win- candidate neighbors on the order of 0(k). Thus, because the 

dow. If the selection criterion is either enclosure or match, 20 number of nodes in an R-tree index in an embodiment of the 

the query propagates to those child nodes/subtrees whose invention can be computed by dividing the number of data 

MBAs enclose the query window. items by the index's node capacity, M, the total space 

At the leaf level, each data item in a visited leaf node is required is 0(k+n/M). As a result, this technique for finding 

examined to determine if it satisfies the applicable selection a set of nearest-neighbors can be performed with limited 

criterion. Thus, while similar tests may be applied at non- 25 memory resources and is highly scalable, 

leaf nodes for intersection and containment criteria and for This method of finding a nearest neighbor may be illus- 

enclosure and match criteria, at leaf nodes the query pro- trated with reference to FIGS. 7A-7B. In FIG. 7 A, dataset 

cessing may differ for each criterion. 600 of FIG. 6A is used for purposes of illustration. The 

^ Nearest-Neighbor Query R-tree of FIG. 7B is one indexed representation of this 

Another type of query that may be implemented in an 30 dataset. 

embodiment of the invention is a nearest-neighbor query. The query point for the illustrated nearest neighbor query 

The purpose of this type of query is to locate the nearest data in FIG. 7A is point 702. The MBAs of the subtrees rooted 

item(s) to a specified point or data value. Based on the target at node 762 and node 764 are depicted by boundaries 712, 

value or point specified by a user and the number of 714, respectively. Similarly, boundaries 720-726 represent 

neighbors requested, an R-tree is searched and the appro- 35 the MBAs of leaf nodes 770-776. For the sake of brevity, it 

priate data items returned to the user. For example, in a will be assumed that only the closest data item to the query 

geographical application a user may specify a place of point is desired (i.e., k=l). 

interest in order to find a desired number of the closest Application of the subject method for finding a nearest 

neighboring points. neighbor begins at root node 752. Accordingly, entries for 

"~ One method of responding to a nearest-neighbor query is 40 the two child nodes of the root (762, 764) are placed on the 

as follows. Apriority queue (which in an alternative embodi- working queue: <762, distance(702, 762)> and <764, 

ment may be a list, stack, or other data structure) for tracking distance(702, 764)>. 

subtrees and/or data items from the R-tree is maintained; the Because MBA 712 of node 762, is closer to query point 

queue is initially empty. Starting with the root node, child 702 than MBA 714 of node 764, the entry <762, distance 

nodes and data items are inserted and removed from the 45 (702, 7 62) > is removed from the queue. In its place are 

queue as the query propagates through the index. A set of k inserted entries for the leaf node children of node 762:<770, 

candidate neighbors is also maintained (e.g., as array entries distance(702, 770)>and <, distance(702,772)>. 

ordered by their distance from the query), where the value The queue entries are again examined and it is determined 

of k may be supplied by the user or may be a default value. that query point 702 is closer to MBA 714 than MBA 720 (of 

First, an entry is added to the queue for each child of the 50 node 770) or MBA 722 (of node 772). Thus, the entry <764, 

root node. Illustratively, entries in the queue are of the form distance(702, 764)> is removed from the queue and replaced 

<item, distance;*, where item may be a node or a data item, with entries for its leaf node children: <774, distance(702, 

depending on the height of the current node within the tree. 774)> and <776, distance(702, 776)>. 

Where the item is a node, distance is the distance from the Of all the queue entries, the closest MBA to the query 

user's point to the closest edge or perimeter of the node's 55 point is now MBA 724 of leaf node 774. Thus, the entry 

MBA. Where the item is a data item, distance is the distance <774, distance(702, 774)> is removed from the queue and 

from the user's specified point to the data item or the closest replaced with entries for each of its data items, including 

boundary of an MBA of the data item. data item 704. 

The entry in the queue having the shortest distance Because the distance between data item 704 and query 

parameter is selected for further processing. Thus, a priority 60 point 702 is the shortest distance in the queue, it is returned 

queue in which entries are ordered on the basis of the as the nearest neighbor for the query point, 

distance from the entry to the query may be best suited for Another method of satisfying a nearest-neighbor query 

this purpose. If the selected entry is another node (e.g., a processes the R-tree in a depth-first manner, through the 

child node of the root), the entry is removed from the queue child subtree that is closest to the user's point of interest. In 

and replaced by entries corresponding to the child entries of 65 this process a queue or list may be maintained to track the 

that node (which may be child nodes or data items). If the subtrees that are circumvented as a path is traversed, which 

selected entry is a data item, however, the data item is subtrees may be visited after reaching the bottom of the path. 
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Yet another method of responding. to a nearest-neighbor that a node that split will now have a different Node_jd than 

query combines the previous two approaches. In particular, was stored during Step 1. By tracing the Sibling field (e.g., 

this approach begins at the root in a depth-first propagation a pointer) of the altered node, recursively if necessary, the 

pattern as in the latter of the two previous methods. Pro- new sibling node can be found and updated as necessary 

cessing continues in a depth-first manner until a subtree is 5 ( e -g*> to slore me correct MBAs for its entries). In a like 

reached that will fit entirely within available memory (e.g., manner, the Parent_row_id field allows the detection of 

cache, RAM). That subtree is then processed using the root node s P lits * 

former method. Thus, this method serves to identify a level ° De embodiment of the invention may be configured for 

within the R-tree at which subtrees may be processed in ™ I* c ? Q ™ blc md , exmg f ramework of Oracle 

memory. Above that level (i.e., closer to the root) nodes are to ^^J". framework > read °f™*ous that are 

j J \ c * u i ! ■ i j*# * .u > conducted during an index query cannot read inconsistent 

ordered first by level then by distance to me user s query dftU J such ^ J rformed Qn ^ $ame 

point. Below the critical level, nodes are ordered using their committed snapshot of the databaj £ within this extensible 

distances to the query point. framework, however, index updates may read the latest 

Intersection Join committed versions of index nodes. In order to resolve any 

Yet another operation that may be performed with R-trees is inconsistencies that may occur during the updates, a tech- 

in an embodiment of the invention is a join operation. In a mque sucn ^ mat described above may be used to detect 

typical join operation, datasets indexed in separate R-trees node splits. A split of a root node, however, may be detected 

are compared to identify areas of the datasets that overlap. by tracing a pointer to a parent node (e.g., the Parent_ 

In one method of performing a join operation on datasets node_id field of a node) from the erstwhile root node. Thus, 

R and S, processing begins at the root nodes of the corre- 20 a root node is not "anchored" once it is created, 

sponding R-tree indexes. A pair of nodes, one from each Buffering During R-Tree Operations 

index, is processed at a time. In processing a pair of nodes, In one embodiment of the invention, nodes of an R-tree 

intersecting pairs of entries of the two nodes (i.e., where the index stored in a database may be buffered during manipu- 

MBAof one entry from one node intersects the MBA of an lation of the index (e.g., addition or deletion of a data item, 

entry from the other node) are identified. If the entries 25 index reorganization). In particular, if the index is too large 

correspond to data items the entry pair is included as part of to fit into available memory space, one or more nodes of the 

the join result If only one of them is a data item, it is used index may be buffered in accordance with a method 

as a window query (with a selection criterion of intersection) described in this section. 

on the subtree corresponding to the other entry. A queue (or As one skilled in the art will appreciate, the performance 

stack, list, etc.) may be maintained to store entry pairs as 30 of an R-tree index may be measured by such factors as the 

they are identified and until they are further processed. number of R-tree nodes that are processed or accessed in 

Concurrency Control response to a query and/or the time needed to respond to a 

In one embodiment of the invention, in which concurrent query. By buffering a portion of the index (e.g., one or more 

update operations may be performed on the database in nodes) when the entire index cannot fit into memory, the 

which an R-tree index is stored, one or more measures may 35 response time can be greatly improved. In particular, by 

be taken to prevent corruption of the index. maintaining index nodes in memory access to them is much 

For example, some operations — such as the initial faster than if they remained on disk or another relatively 

construction/storage and/or reorganization of the tree — may slow storage device. 

be performed in an exclusive mode. In an exclusive mode of The probability of accessing a particular node of the index 

operation, no other update operations may be performed, 40 during a query or other operation may depend upon whether 

thus eliminating the risk of corrupting the index. the operations are expected to be evenly distributed across 

Another measure that has already been introduced is the dataspace (e.g., the set of all multi-dimensional data 
postponing the deletion of empty nodes until an index is items indexed in the R-tree) or concentrated in one or more 
reorganized. This measure in large part limits the danger of regions of the dataspace. In addition to this factor, a buff- 
distorting the index to insertion operations when a node is 45 ering technique practiced in an embodiment of the invention 
split. Therefore, in one embodiment of the invention the may also consider that a parent node must be accessed 
following additional safeguards are taken. before its children. Even further, the multi-dimensional 

First, whenever a node, call it node 1, splits into two nature of the data items must be taken into account (e.g., a 

nodes, call them node 1 and node 1' the Node_Jd of node 1 particular data item may be retrieved on the basis of its 

before the split is assigned to node 1*. Node 1 then receives 50 value(s) for one or more dimensions or attributes), 

a new Node_id generated by a Node__id generator. Second, When the impact of queries or other operations (e.g., the 

when the tree is being traversed during an update operation number or distribution of which nodes are accessed) is 

(e.g., addition or deletion of a data item), the node Node_id expected to be fairly uniform over a given dataspace, the 

is also saved for each node that is added to the update path probability of accessing a particular node should be propor- 

(e.g., insertion-path from the insertion algorithm provided 55 tional to the space it encompasses. Therefore, in this situa- 

above). Third, when a root node such as nodeR splits into tion nodes may be buffered on the basis of their size, which 

nodeR and nodeR', a new root, nodeS, is created with entries may be measured by the amount of dataspace encompassed 

for nodeR and nodeR'. The leftmost node of nodeR and by its MBA (i.e., the MBA of the node's entry in its parent 

nodeR* sets its Parent_row_id field to the Row_id of node) or the MBAs of all its descendant data items. 

nodeS and its Parent_node_id to the Node_id of nodeS. 60 Illustratively, a root node's size, since it has no parent, 

As a result of these safeguards, if a node splits during an corresponds to the entire dataspace. Thus it will be seen that 

update operation the split is detected and the integrity of the when buffering in an environment of uniform query distri- 

index is maintained. For example, in the insertion algorithm bution the R-tree *s root node will be the first node buffered, 

described above, each time a node is added to the insertion- which is logical since many operations on an index begin at 

path in Step 1, its Node_id value is also saved. Later, when 65 the root. 

nodes in the insertion-path are updated in Step 2 to account In addition to the root node, a successive number of nodes 

for MBA alterations and/or split propagations, it will be seen may be buffered — based on their sizes — until the buffer can 
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hold no more nodes. Typically, this method will result in the 
upper portions of the R-tree being buffered. If, however, the 
next node to be buffered is too large for the buffer, in order 
to make the best use of the buffer space a smaller node may 
be stored. 5 

This method of buffering nodes for uniform query distri- 
butions may be extended to make it suitable for situations in 
which the distribution of queries is concentrated in one or 
more regions of the dataspace. In particular, in one embodi- 
ment of the invention statistics are collected concerning 
access to individual nodes, data items or clusters of data 
items. Maintaining such statistics facilitates the determina- 
tion of which nodes or regions are accessed most frequently. 

Then, based on the collected access statistics, the most 
frequently accessed nodes may be buffered first. Because 
one particular data item or set of data items may be accessed 15 
in a large proportion of queries or other buffered operations, 
the nodes that are buffered from the index may form a path 
toward those items. When two or more nodes have been 
accessed with equal or near equal frequency and one must be 
chosen for buffering, the determination of which node to 
buffer may be made on the basis of which node is associated 
with a larger MBA or has more data items in its subtree. 

The composition of a buffer may therefore change over 
time. If the focus of users* queries changes from one portion 
of an index to another, the buffered nodes may gradually 
change from one set to another set. Likewise if query 
operations change from being relatively uniform in distri- 
bution to being concentrated in a region. 

In one embodiment of the invention the frequency with 
which a particular node is accessed is tracked by one or more 
counters. For example, a separate counter may be associated 
with each node of the R-tree, each cluster of data items, etc. 
The counters may be initialized or re-initialized when the 
database is started, when the index is re-organized or 
updated, at specified or programmable intervals, etc. 

The foregoing descriptions of embodiments of the inven- 
tion have been presented for purposes of illustration and 
description only. They are not intended to be exhaustive or 
to limit the invention to the forms disclosed. Many modi- 
fications and variations will be apparent to practitioners 
skilled in the art. Accordingly, the above disclosure is not 
intended to limit the invention; the scope of the invention is 
defined by the appended claims. 

What is claimed is: 

1. A computer readable storage medium storing instruc- 45 
tions that, when executed by a computer, cause the computer 
to perform a method of constructing a hierarchical index 
from a set of multi-dimensional data, the method compris- 
ing: 

(a) calculating a number of members of a set of multi- 
dimensional data items; 

(b) determining whether said number of members exceeds 
a node capacity of a hierarchical index configured to 
index said data items; 55 

(c) determining a variance of values in one or more 
dimensions of said data items; 

(d) identifying a first dimension of said one or more multi 
dimensions in which to divide said set of multi- 
dimensional data items; 50 

(e) sorting said data items in said first dimension; 

(f) dividing said sorted data items in said first dimension 
into two or more subsets by: 

(f) determining whether said number of members is 
less than double said node capacity; and 65 

(f ) calculating the Boor of the quotient of said node 
capacity divided by two; 
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(g) repeating (a)-(f) for each said subset in order to divide 
said set of data items into a plurality of data item 
clusters, wherein each cluster comprises a number of 
data items no greater than said node capacity; and 

(h) configuring a leaf node of said hierarchical index for 
data items in a first cluster. 

2. A computer-implemented method of constructing a 
hierarchical index from a set of multi-dimensional data, 
comprising: 

(a) calculating a number of members of a set of multi- 
dimensional data items; 

(b) determining whether said number of members exceeds 
a node capacity of a hierarchical index configured to 
index said data items; 

(c) determining a variance of values in one or more 
dimensions of said data items; 

(d) identifying a first dimension of said one or more multi 
dimensions in which to divide said set of multi- 
dimensional data items; 

(e) sorting said data items in said first dimension; 

(£) dividing said sorted data items in said first dimension 
into two or more subsets by: 

(f ) determining whether said number of members is 
less than double said node capacity; and 

(f") calculating the floor of the quotient of said node 
capacity divided by two; 

(g) repeating (a)-{f) for each said subset in order to divide 
said set of data items into a plurality of data item 
clusters, wherein each cluster comprises a number of 
data items no greater than said node capacity; and 

(h) configuring a leaf node of said hierarchical index for 
data items in a first cluster. 

3. The computer-implemented method of claim 2, 
wherein said identifying comprises selecting a dimension, 
from said one or more dimensions, having the greatest 
variance of values. 

4. The computer-implemented method of claim 2, in 
which said configuring comprises: 

creating a first leaf node of the index; and 
inserting an entry in said first leaf node for each data item 
in a first subset of said set of multi-dimensional data 
items, wherein a first entry for a first data item com- 
prises: 

an identifier of said first data item; and 

a bounding area encompassing said first data item. 

5. The computer-implemented method of claim 2, 
wherein the dimensions of said multi-dimensional data are 
inherently related. 

6. The computer-implemented method of claim 2, 
wherein the dimensions of the multi-dimensional data are 
independent attributes. 

7. The computer-implemented method of claim 2, further 
comprising identifying a query pattern for retrieving one or 
more data items from said set of data items, wherein said 
query pattern comprises a hierarchy of two or more dimen- 
sions of said multi-dimensional data items. 

8. The computer-implemented method of claim 7, which 
said identifying comprises selecting a dimension in said 
hierarchy of two or more dimensions. 

9. A computer readable storage medium storing instruc- 
tions that, when executed by a computer, cause the computer 
to perform a method of constructing a hierarchical index 
from a set of multi-dimensional data, the method compris- 
ing: 

(a) calculating a number of members of a set of multi- 
dimensional data items; 
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(b) determining whether said number of members exceeds 
a node capacity of a hierarchical index configured to 
index said data items; 

(c) determining a variance of values in one or more 
dimensions of said data items; 

(d) identifying a first dimension of said one or more multi 
dimensions in which to divide said set of multi- 
dimensional data items; 

(e) sorting said data items in said first dimension; 

(f) dividing said sorted data items in said first dimension 
into two or more subsets by: 

(f ) determining whether said number of members is 
greater than or equal to double said node capacity; 
and 

(f ') calculating the value of said node capacity times 
(the floor of (the sum of 0.5 plus (the quotient of said 
number of members divided by twice said node 
capacity))); 

(g) repeating (a)-(f) for each said subset in order to divide 
said set of data items into a plurality of data item 
clusters, wherein each cluster comprises a number of 
data items no greater than said node capacity; and 

(h) configuring a leaf node of said hierarchical index for 
data items in a first cluster. 

10. A computer-implemented method of constructing a 
hierarchical index from a set of multi-dimensional data, 
comprising: 

(a) calculating a number of members of a set of multi- 
dimensional data items; 

(b) determining whether said number of members exceeds 
a node capacity of a hierarchical index configured to 
index said data items; 

(c) determining a variance of values in one or more 
dimensions of said data items; 

(d) identifying a first dimension of said one or more multi 
dimensions in which to divide said set of multi- 
dimensional data items; 

(e) sorting said data items in said first dimension; 

(f) dividing said sorted data items in said first dimension 
into two or more subsets by: 

(f ) determining whether said number of members is 
greater than or equal to double said node capacity; 
and 

(f ') calculating the value of said node capacity times 
(the floor of (the sum of 0.5 plus (the quotient of said 
number of members divided by twice said node 
capacity))); 

(g) repeating (a)-(f) for each said subset in order to divide 
said set of data items into a plurality of data item 
clusters, wherein each cluster comprises a number of 
data items no greater than said node capacity; and 

(h) configuring a leaf node of said hierarchical index for 
data items in a first cluster. 

11. The computer-implemented method of claim 10, 
wherein said identifying comprises selecting a dimension, 
from said one or more dimensions, having the greatest 
variance of values. 

12. The computer-implemented method of claim 10, in 
which said configuring comprises: 

creating a first leaf node of the index; and 
inserting an entry in said first leaf node for each data item 
in a first subset of said set of multi-dimensional data 
items, wherein a first entry for a first data item com- 
prises: 
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an identifier of said first data item; and 

a bounding area encompassing said first data item. 

13. The computer-implemented method of claim 10, 
wherein the dimensions of said multi-dimensional data are 
inherently related. 

14. The computer-implemented method of claim 10, 
wherein the dimensions of the multi -dimensional data are 
independent attributes. 

15. The computer-implemented method of claim 10, fur- 
ther comprising identifying a query pattern for retrieving 
one or more data items from said set of data items, wherein 
said query pattern comprises a hierarchy of two or more 
dimensions of said multi-dimensional data items. 

16. The computer-implemented method of claim 15, in 
which said identifying comprises selecting a dimension in 
said hierarchy of two or more dimensions. 

17. A computer readable storage medium storing instruc- 
tions that, when executed by a computer, cause the computer 
to perform a method of indexing a set of multi-dimensional 
data items, the method comprising: 

sorting a set of multi-dimensional data items in a first 
dimension, said first dimension corresponding to a 
primary dimension of a first query pattern; 

dividing said sorted set of multi-dimensional data items in 
said first dimension into two or more first-order subsets 
of multi-dimensional data items; 

sorting a first first-order subset of said two or more 
first-order subsets of multi-dimensional data items in a 
second dimension, said second dimension correspond- 
ing to a secondary dimension of said first query pattern; 

dividing said first first-order subset of multi-dimensional 
data items in said second dimension into two or more 
second -order subsets of multi-dimensional data items; 
and 

storing the members of an N-order subset of multi- 
dimensional data items in a leaf node of a hierarchical 
index, wherein the number of data items in said N-order 
subset is less than or equal to a node capacity of said 
hierarchical index and N is greater than or equal to two. 

18. A computer-implemented method of indexing a set of 
multi-dimensional data items, comprising: 

sorting a set of multi-dimensional data items in a first 
dimension, said first dimension corresponding to a 
primary dimension of a first query pattern; 

dividing said sorted set of multi-dimensional data items in 
said first dimension into two or more first-order subsets 
of multi-dimensional data items; 

sorting a first first-order subset of said two or more 
first-order subsets of multi-dimensional data items in a 
second dimension, said second dimension correspond- 
ing to a secondary dimension of said first query pattern; 

dividing said first first-order subset of multi-dimensional 
data items in said second dimension into two or more 
second -order subsets of multi-dimensional data items; 
and 

storing the members of an N-order subset of multi- 
dimensional data items in a leaf node of a hierarchical 
index, wherein the number of data items in said N-order 
subset is less than or equal to a node capacity of said 
hierarchical index and N is greater than or equal to two. 

19. The method of claim 18, wherein said first dimension 
is a hierarchical attribute of said data items. 

20. The method of claim 18, further comprising storing 
said leaf node in a relational database. 

21. The method of claim 18, wherein said sorting a set of 
multi-dimensional data items comprises: 
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determining a variance of said multi-dimensional data 
items in one or more dimensions of said multi- 
dimensional data items; and 

selecting a first dimension of said multi -dimensional data 
items having a greater variance than a second dimen- 5 
sion of said multi-dimensional data items. 

22. The method of claim 18, in which said dividing said 
sorted set of multi-dimensional data items in said first 
dimension comprises: 

determining the number of data items in said set of 10 

multi-dimensional data items; 
calculating an intermediate value in said first dimension; 

and 

selecting a first data item corresponding to said interme- 15 
diate value. 

23. The method of claim 22, wherein said intermediate 
value is an approximate median value. 

24. The method of claim 22, in which said calculating an 
intermediate value comprises: 2 o 

determining whether said number of data items is less 
than twice the capacity of a node of said hierarchical 
index; and 

calculating the floor of the quotient of said node capacity 
divided by two. 25 

25. The method of claim 22, in which said calculating an 
intermediate value comprises: 
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determining whether said number of data items is greater 
than or equal to twice the capacity of a node of said 
hierarchical index; and 

calculating the value of said node capacity multiplied by 
(the floor of (the sum of 0.5 plus (the quotient of said 
number of data items divided by twice said node 
capacity))). 

26. The computer-implemented method of claim 18, fur- 
ther comprising: 

storing an identifier of a root node of said hierarchical 
index; 

storing a node capacity of said hierarchical index; and 
storing a measure of the dimensionality of said multi- 
dimensional data items. 

27. The computer-implemented method of claim 18, 
wherein said storing the members of an N-order subset of 
multi-dimensional data items comprises: 

creating a first leaf node of said hierarchical index; and 
inserting an entry in said first leaf node for each said 
member of said N-order subset of multi-dimensional 
data items, wherein each said entry comprises: 
an identifier of said corresponding member; and 
a bounding area encompassing said corresponding 
member. 
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